Contagious statistical distributions: k-connections and applications in infectious disease environments

Contagious statistical distributions are a valuable resource for managing contagion by means of k–connected chains of distributions. Binomial, hypergeometric, Pólya, uniform distributions with the same values for all parameters except sample size n are known to be strongly associated. This paper describes how the relationship can be obtained via factorial moments, simplifying the process by including novel elements. We describe the properties of these distributions and provide examples of their real–world application, and then define a chain of k–connected distributions, which generalises the relationship among samples of any size for a given population and the Pólya urn model.


Introduction
A large body of literature has been generated regarding mathematical models of epidemiology. These models usually consider the population under study to be clustered as follows: persons born with passive immunity (denoted by M), those without passive immunity and hence susceptible (S), those who are infected but not infectious (E), those who are capable of transmitting the infection, and hence infectious (I), and those who have a permanent infectionacquired immunity, and hence recovered (R). Different epidemiology models are classified according to which of these clusters are considered: MSEIR, SEIR, SIR, etc.
Among the main parameters included in these models are the basic reproduction number of an epidemic, R 0 (that is, the expected number of secondary cases produced by a primary case during their infectious period, within a completely susceptible population), and the degree of herd immunity (the fraction of immune individuals within the population beyond which the epidemic can no longer grow). A brief historical summary of mathematical models in epidemiology can be found in Hethcote [1].
Knowledge and understanding in this area have advanced rapidly, and many empirical studies have improved upon the classical models by including significant features such as the effects of heterogeneity and correlations, household effects, network-driven contagion and mobility models. Sun et al. [2], in a study based on Chinese data, considered the role of transmission heterogeneities, which are driven by demography, behaviour and interventions. Kawagoe et al. [3]examined the question of infectious disease dynamics in heterogeneous populations and the role played by "superspreaders". Aleta et al. [4] studied the effects of testing, quarantine and contact tracing, and Huber et al. [5] proposed a tracing strategy to optimise the cost/effect balance. Chang et al. [6] successfully developed a SEIR model which used mobile phone geolocation data.
Most of the mathematical models proposed require certain assumptions about the dynamics of infectious disease. For instance, a common and sometimes unrealistic assumption is that there is the same probability of any infectious individual infecting any susceptible one, a relation that is termed homogeneity. Britton et al. [7] showed that the contrary situation, that of population heterogeneity, can have a considerable impact on disease-induced immunity because the proportion of infected individuals in groups with higher contact rates is greater than that in groups with lower contact rates. Hébert-Dufresne et al. [8] showed, using random network theory to predict the size of an epidemic, that without data on the heterogeneity in secondary infections (which are needed to estimate its cumulative distribution function) the size of the outbreak remains highly uncertain.
Seeking to avoid the above assumption, various improvements to the models have been suggested. For instance, Neipel et al. [9] generalised the SIR model taking into account generic effects of heterogeneity on the population's degree of susceptibility to infection. Introducing a new parameter, that of a power-law exponent of the susceptibility distribution at small susceptibilities, Neipel et al. showed that the class of gamma distributions acts as an attractor of the dynamics, making it possible to identify generic effects of population heterogeneity.
Another common assumption which may be unrealistic is the "law of large numbers" (LLN), meaning that the population size is large enough to accurately describe random dynamics with asymptotic elements, such as limit probability distributions. However, many situations of infectious disease spread originate within a closed environment (school classrooms, for instance) with a population size, where the LLN assumption does not hold. In this circumstance, attempting to forecast the behaviour of infection dynamics by means of the classical models would be quite misleading. Brooks et al. [10] developed a stochastic transmission model of infection spread in university campuses, based on realistic mixing patterns, and evaluated various infection mitigation strategies. Mayberry et al. [11] presented dynamic random graph techniques for modelling small population outbreaks, allowing different interaction rates among students. These authors analysed Monte Carlo simulations, assuming a beta negative binomial distribution, to determine the effects of different transmission rates and of diverse vaccination strategies on the dynamics of a hypothetical outbreak of influenza. With respect to the COVID-19 pandemic, several guidelines on appropriate antigen-testing strategies have been developed. For instance, the National Academies of Sciences, Engineering, and Medicine [12] provided a general guide for colleges and universities in the USA, and Nixon et al. [13] described how the University of Bristol (UK) developed CONQUEST, a tool to record and analyse data on COVID-19.
In this paper, we consider contagious statistical distributions and theoretical tools which could be applied to certain scenarios of infection, such as a closed environment with several rooms. In addition, we present techniques showing how complementary information on the statistical behaviour of infectious disease spread can be obtained.
Contagious statistical distributions are valuable toolboxes relating to the epidemiology of communicable diseases. These resources enhance our understanding of the presence of contagions in, for example, a confined space. In a more general scenario, assume a system with n components. Each component could have a different workload and hence a different probability of failing. The variable considered is the number of failure events. Now assume a different number of system components and a different workload for each component; nevertheless, the overall proportion of failing components remains unaltered. The question then arises: when X (n) is the number of fails when the system contains n components, what relationship exists (if any) among the probability distributions of X (n) for n = 1, 2, . . .?
Such a scenario is most commonly modelled using a classical binomial distribution, in which each component has the same probability of failing. However, this means there may be a high variance, and therefore a large statistical error in the estimates obtained. A second concern is the implicit assumption that the binomial probability mass function (pmf) seems to resemble a Gaussian curve (as the n increases), producing a certain symmetry, unimodality, etc. This assumption is not always valid. Finally, independence cannot always be assumed.
The Pólya urn (contagious) model described by Eggenberger and Pólya [14] models the above situation by considering an urn which initially contains W white balls (cases) and R red ones (others). One ball is sampled at random and returned to the urn with c additional balls of the same colour. After this procedure has been applied to n samples, the variable, X (n) , which counts the number of white balls sampled is said to be Pólya distributed, and is denoted by X (n) � P(W, R, c, n). Its pmf, i.e. the probability that, after n draws, w white balls (representing cases of infection) and n − w = r red balls (representing individuals free of infection) have been drawn, is given by where p = W/(W + R), q = 1 − p, and δ = c/(W + R), subject to the following feasibility conditions: . . ., p n ), and its pmf is given by where logit(p i ) = p i /(1 − p i ), for i = 1, . . ., n, and the summation is over all possible combinations of different i 1 , . . ., i k from {1, . . ., n}. Clearly, the mean of the random variable X (n) is given by This main result from [19] is quite surprising, meaning that the number of successes in n dependent Bernoulli trials can be described as the number of successes in independent ones.
The Poisson-Binomial distribution has had relatively little research attention in recent years, mainly due to the absence of assumptions regarding its parameters: the Poisson binomial family contains quite different distributions, with quite different properties, and n parameters are required to model a random variable which can take n + 1 values. Among the few more or less recent papers on these distributions, theoretical results have been reported by Schlemm [20] and a goodness of fit test was proposed by Acharya and Daskalakis [21]. In addition, some work on approximation, by different methods, has been done by Neammanee [22,23] and Barbour [24], Skipper [25], Butler and Stephens [26] and Novak [27]. Studies related to the computation of probabilities include Hong [28] and Barrett and Gray [29]. Analyses in which the model has proven useful are described in Chen and Liu [30], Tejada and den Dekker [31] and Rosenman and Viswanathan [32]. An excellent review of the most recent progress related with the Poisson-Binomial distribution is Tang and Tang [33].
Let us assume that not only Poisson-Binomial models but any finite distribution might best fit the data. If the cdf of each X (n) is denoted by F (n) then we wish to find chains of distributions of the form {F (n) : n = 0, . . ., M}, where M could be infinity or an integer upper bound to the chain, and where all these distributions share certain regularity conditions. However, these conditions cannot be defined in a simple way.
A chain of finite distributions has a relationship called k-connection, meaning there exists a strong relationship among the respective factorial moments, which can be viewed as a regular pattern of behaviour within a contagious environment. This, in turn, implies the existence of proportionality in the expected means, variances, etc., thus providing us with an instrument to manage the behaviour of the number of infections taking place in an environment as its population increases.
By means of this relationship, a model for the number of successes in n = n 0 trials can not only be described by a given distribution F (n) , but can also facilitate a chain of k-connected distributions for any other feasible sample size. When the relationship among the models within a chain is assumed as part of the model, it can be tested or estimated from samples of different sizes.
The aim of this paper is to describe and/or characterise families of discrete distributions parametrised by a sample size. These distributions are used to model contagion via a relationship that we term k−connectedness. We show that this relationship can be presented in a natural way, among many well-known families of discrete distributions, such as the chains {Bin(n, p): n � 0}, {P(W, R, −1, n): n = 0, . . ., M}.
What is this relationship useful for? Theorem 1 shows that chains of connected distributions are feasible statistical models for estimating the proportion of infected individuals in a population, given samples of varying sizes from this population. In other words, sample observations would commonly be used, jointly, with different sample sizes to estimate a unique or common value for the probability of success, p. Thus, data from different distributions within a chain of connected distributions can be jointly used for inference. This powerful possibility is proven to be feasible within any given chain of k−connected distributions.
The rest of this paper is organised as follows. Next, we present the following theoretical elements considered, and describe their main properties: the connecting function of a finite random variable or distribution; the k-connection relationship; the chains of connected distributions; and the chain-generating sequence. In addition, we provide a triangular table to represent a chain generating sequence. Finally, some subsets of well-known families of finite distributions are shown to be chains of connected distributions. In the Estimation section, we then illustrate a practical application of these elements, with a real-world example of their use, showing that samples from different distributions belonging to the same chain of k-connected ones can be used jointly for estimation. A simulation study is also performed to rule out the possibility of errors in the estimation process. Finally, we summarise the main conclusions drawn.

Chain of distributions
In this section, we define and study some auxiliary elements to simplify the definition of a chain of k-connected distributions. Instead of addressing this relationship by means of factorial moments, we do so using a characteristic function of the distributions, termed the connecting function. To facilitate the detection and management of a chain of k-connected distributions, we also define the chain generating sequence, i.e. the sequence of real numbers that characterises a given chain of k-connected distributions. Some classical (but previously unknown) chains and their generating sequences are also shown.

Definition 1 X (n) be a random variable with support in the integer interval [0, n]. The function
is then termed the connecting function of X (n) . Expression (3) can be rewritten in terms of the probability generating function (pgf)

Proposition 1
The connecting function of a Poisson-Binomial distributed variable, X (n) � PB(p 1 , . . ., p n ) is given by Proof. Given that the well-known Poisson-Binomial probability generating function can be expressed as the proof follows immediately from (4).

Corollary 1
The connecting function of a random variable X (n) with support on the integer interval [0, n] is a polynomial with real roots iff X (n) is a Poisson-Binomial distributed variable.
Proof. To prove this, it only has to be noticed that any real root of CðzÞ is inside the real The connecting function is no more than a particular probability generating function. Nevertheless, it is a useful means of presenting the natural concept of chain of connected distributions, which to our knowledge has not been addressed before. In this understanding, we first introduce the concept of k−connection and then go on to prove that it is the common internal relationship of certain particular sets of discrete probability distributions.
Definition 2 Let X (n) and X (n+k) be random variables with respective connecting functions C n ðzÞ and C nþk ðzÞ. Both variables and their respective distributions are said to be k-connected if When a pair of random variables, X (n) and X (n+1) , are 1-connected, they are said to be connected.
For instance, in the binomial distributions Bin(n, p) and Bin(n + 1, p), we have that C nþ1 ðzÞ ¼ ðz À pÞ nþ1 ; and so Bin(n, p) and Bin(n + 1, p) are 1-connected. Analogously, Bin(n, p) and Bin(n + 2, p) are 2-connected distributions, and so on. The same outcomes are obtained in most classical finite models, such as Pólya distributions and discrete uniform distributions.
In the following, we use the standard Pochhammer notation for the falling and rising factorials: The following properties are straightforwardly proven. Proposition 2 Let X (n) and X (n+1) be connected random variables with respective connecting functions C n ðzÞ; and C nþ1 ðzÞ: Let h 2 {n, n + 1}. Then: The connecting function can also be written as Proof. Parts 1 and 2 are straightforward from (3). To prove 3, denote by and taking into account that d dz C nþ1 ðzÞ ¼ ðn þ 1ÞC n ðzÞ: The remaining properties are obtained immediately. It is obvious that if X (n) , X (n+k) are k-connected and X (n+k) , X (n+k+h) are h-connected, then X (n) , X (n+k+h) are k + h-connected. Accordingly, this can be considered a sequence of consecutively connected variables, meaning that any pair of them are k-connected.
Notice that item 3 in Proposition 2 gives a recurrence relationship among the distributions. This relationship is verified by the well-known subfamilies of discrete distributions which are applied to n-sampling from a given population.
Definition 3 A set of random variables X (0) , X (1) , . . . such that any pair of them are k-connected for the appropriate k is said to be a chain of connected distributions.
A chain of connected distributions can be finite or infinite, depending on its nature, and its first element is a degenerate random variable which takes a null value with full probability. In an example below, we demonstrate that binomial chains contain one distribution for each sample size. However, a chain of hypergeometric distributions only contains a finite number of ones, as the samples without replacement cannot be higher than the population. Given a discrete distribution F (n) on {0, . . ., n}, it is easily proven that there exists a chain of connected distributions {F (k) : k = 0, . . ., n} which contains F (n) . The question to be solved is whether there exists an additional distribution F (n+1) that would extend the chain.
Any chain of connected distributions is characterised by a sequence of real numbers, such that the chain can be extended if this is possible, and if not, this is apparent. These distributions are termed chain generating sequences. In this case, the finite difference operator of a sequence of numbers is denoted as The following properties of this operator are evident:

Lemma 1 Given a sequence of real numbers A = {a
Definition 4 Let A = {a k : k = 0, . . ., n} be a sequence of real numbers that verifies: Then, A is termed a chain generating sequence (cgs). It can be easily proven that each element of a cgs lies within the real interval [0, 1], and that a k � a k + 1 , for any k = 0, . . ., n − 1.
Proof. Following part b in 4 from Proposition 2, we obtain one direction of the iff. To prove the opposite direction, and for h 2 {n − 1, n}, the polynomial expression of C h ðzÞ is given by After identifying the terms in the polynomial expression of C nÀ 1 we obtain that dC n ðzÞ dz ¼ nC nÀ 1 ðzÞ is equivalent to verifying (7), and so the proof is complete.
Then, the set of random variables {X (k) : k = 0, . . ., N} such that is a chain of connected distributions, which we term the chain of connected distributions generated from A. Proof. Proceed recursively. For k = 1, f 0,0 = a 0 = 1, and f 1,0 = Δ 1 a 0 = 1−a 1 � 0, f 1,1 = Δ 0 a 1 = a 1 . Notice that f 0,0 = (f 1,0 + f 1,1 )/1, and apply Lemma 2. Now, if the result is true for k − 1, we search for a pmf f k = (f k,0 , . . ., f k,k ) which is connected to f k−1 = (f k−1,0 , . . ., f k−1,k−1 ). From Lemma 2, the following linear system must be solved: Notice that condition X k i¼0 f k;i ¼ 1 is redundant, and that the system has infinitely many solutions. A particular solution can be found by first taking f k,k = (−1) 0 Δ 0 a k = a k . After some straightforward, if tedious, calculus, the proof is complete. Given any given finite distribution F (n) within the integer interval [0, n] it is simple to obtain a chain that contains F (n) , and to determine whether another distribution F (n+1) could be added to the chain. The necessary procedure, which somewhat resembles Pascal's triangle, only requires the use of (8) and (5), as shown in the following example.
Example 2 Let X (2) be a random variable with a pmf given by Thus, Now, using (5) and from bottom to top, we can easily derive the following triangle: From this triangle, the pmf's of X (k) , k = 0, 1, are also found, again from (8): We now wish to find a feasible value for a 3 = x. In order to preserve the condition of cgs, the entries in the additional row must maintain the sign of each column: To conclude, x = 0 leads to f 3 ¼ 0; 1 2 ; 1 2 ; 0 À � , which is a uniform distribution on {1, 2}. For any unknown cgs {1, a 1 , . . ., a n }, the squares of a generic triangle table are easily found as a linear function of a i , by using (6) and (8). Moreover, it is easy to obtain cases where a cgs can be enlarged with a unique feasible additional number, or within a rank of values, or it is impossible, as the last row gives an inequalities system, which can have just one solution, or no solution or an infinite number of solutions.
The nature of the k-connection among distributions is illustrated by the following set of results. The proof of each one reduces to simple checking. The parameter notation is shown in the Introduction section. Proof. The proof is immediate from (8).
The latter result has an interesting meaning, namely that in sampling without replacement, higher values of n lead to null probabilities for some extreme values of X (n) , meaning that its support set is not actually the integer interval [0, n], but a subset within it. Proof. The proof is immediate from (8).

Proposition 6
The sequence a n = (n + 1) −1 , for n � 0 is a cgs and the chain of distributions generated is the family of discrete uniform distributions in the integer intervals [0, n].
Proof. The proof is immediate from (8).
From the previous results, it seems that the k-connection of finite distributions is an essential relationship, which is present in a natural way, although largely unnoticed. Accordingly, it seems credible that certain apparently unrelated distributions may actually present a similar relationship, for example, Poisson-Binomial distributions with no common individual probabilities of success. On the other hand, any more or less arbitrary finite distribution can be connected to a chain.

Estimation
Consider the following scenario. In a given country or city, the presence of infectious disease is noticed, and planners wish to model the number of contagious persons in a classroom, waiting room or similar. An initial approach to this task might be to create a binomial model, whereby the parameter to be estimated would be the proportion of contagious persons, p within the total population in the environment. This parameter could be estimated from samples of rooms containing any number of persons. If any model other than a binomial one were considered, the existence of different-sized rooms (i.e. capacities) could make it difficult or even impossible to conduct a joint estimation.
On the one hand, the possibility of making joint use of different room capacities is helpful, as the number of persons within each room is another random variable in itself. But on the other hand, the binomial model requires the (uncomfortable) condition of independence among the numbers present in each room.
Given these circumstances, a helpful, relaxed condition could be to consider that the number of contagious persons in a room of any size is k−connected random variables. The implications of this assumption are only the regularity conditions given by the proportionality of the factorial moments for each room size. The advantage of this assumption is that it reduces the problem of estimation to that of finding a cgs, {a n : n = 0, . . ., M}, and making joint use of all data, with no constraints on room sizes.
Consider the following notation. There are N rooms; inside each room k j people are meeting, where j = 1, . . ., N and each k j 2 {1, . . ., M}, and where M is the highest number of persons observed in a room. The number of persons infected after each meeting is given by the number of cases where X (k) = i, that is, the number of rooms with k persons attending and where i of them are infected after their meeting. We also denote that is, the number of rooms with k persons present. Then, N ¼ P k i¼0 d k . The problem to be addressed is then to estimate To do so, estimatesf k for f k can be found by solving the following program: In this case, the best results are obtained by the quadratic norm kxk = x 2 .
Then, by using (6) and (8), eachf k;i can be written as a linear function of {a k : k = 0, . . ., M}. These are well-known, and no comments are needed in this paper about convergence kdðk; iÞ À d k �f k;i k d k !1 ! 0 (from the law of large numbers) or the chi-square goodness-offit test for each pmf, f k .
The following example illustrates how even with a sparse dataset, estimation is feasible.

Example 3
Assume an infectious disease outbreak and the known existence of meetings at which some of those present have been infected. At each meeting, there are three, four or five attendees. Consider eight samples, as shown in Table 1: Here, the values X ðk j Þ j (Infected) from rooms with the same number of attendees, k j = 3, 4, 5 (Attendees) are shown in each row. For rooms with three attendees, only one meeting concluded with no persons infected; two meetings concluded with one infected, three with two infected and two with three infected.

A simulation experiment
Suppose that, in a contagion situation within several rooms, as in the Estimation section, the probability distributions of X (k) (number of infected persons after a meeting with k attendees) are k-connected hypergeometric, meaning that X 1000 simulations of each case were performed and the results obtained are shown in Tables  3 to 8. In each table, the exact values for the cgs (a i ) and the last pmf, Pr(X (20) = i), are shown beside the respective mean squared error (mse) for the estimations. In Tables 3 and 4, the simulated data correspond to H(100, 20, n)simulations; in Tables 5 and 6, data belong to H(100, 50, n); and in Tables 7 and 8 the data correspond to H(100, 80, n) estimations, where each pair of tables corresponds to Scenario 1 and 2, respectively.

Conclusions
In the epidemiology of communicable diseases, it is essential to analyse and control the spread of infection, which can provoke severe problems not only for public health but in many other areas (for example, a major outbreak may force schools and universities to close). In this respect, statistical procedures such as contagious statistical distributions can be a useful means of studying and controlling the situation. In this paper, we describe the development of procedures directly linked to the modelling of contagion in a closed environment through k-connected chains of distributions. A chain of k-connected distributions contains a single probability distribution for each integer interval [0, n] as its support set, where n = 0, . . ., N, and where N could be infinity. These distributions are closely related, as verified by various well-known models for sampling within a given population and for other families of finite distributions.
Conversely, any probability distribution with a support set in the integer interval [0, N] belongs to a chain, which could be enlarged with other distributions with support sets [0, N + 1], [0, N + 2], . . .
A major application of this result is as a means of estimating the probabilities of a set of distributions from sparse data, as we show in an example. This example also illustrates how the approach described can be used to obtain a contagious model which contains the Pólya distribution as a particular case. When the hypothesis of k-connection among a set of finite distributions is accepted, the pmf of each one can be estimated with data from some of them. Therefore, the k-connection might be considered not only a generalisation of the relationship among sampling distributions from a given population, in which different sample sizes can be jointly used to estimate the common probability of success, p, but also a generalisation of the Pólya contagious model, as both can be obtained as particular cases of k-connected chains.